Feature Engineering Bookcamp by Sinan Ozdemir;

Feature Engineering Bookcamp by Sinan Ozdemir;

Author:Sinan Ozdemir; [Ozdemir, Sinan]
Language: eng
Format: epub
Publisher: Simon & Schuster
Published: 2022-08-24T22:00:00+00:00


Stemmed

Removed of any stop words

And in this case our resulting list of tokens is

['wait', 'plane']

We can now use this custom tokenizer by setting our TfidfVectorizer’s tokenizer parameter, as seen in listing 5.16. Note that because our tokenizer will lowercase and remove stop words for us, we won’t need to grid search for these parameters.

Listing 5.16 Using our custom tokenizer

ml_pipeline = Pipeline([ ('vectorizer', TfidfVectorizer(tokenizer=stem_tokenizer)), ❶ ('classifier', clf) ]) params = { # 'vectorizer__lowercase': [True, False], # 'vectorizer__stop_words': [], ❷ 'vectorizer__max_features': [100, 1000, 5000], 'vectorizer__ngram_range': [(1, 1), (1, 3)], 'classifier__C': [1e-1, 1e0, 1e1] } print("Stemming + Log Reg
=====================") advanced_grid_search( # remove cleaning train['text'], train['sentiment'], test['text'], test['sentiment'], ml_pipeline, params )

❶ Using a custom tokenizer

❷ Not needed anymore, as our tokenizer is removing stop words and is lowercasing

Our results (figure 5.19) show a reduction in performance, like we saw with our text cleaning.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.